Node metrics #948

cody-littley · 2024-12-03T17:46:35Z

Why are these changes needed?

Adds metrics to the v2 DA node.

Checks

I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
I've checked the new test coverage and the coverage percentage didn't drop.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

Signed-off-by: Cody Littley <[email protected]>

node/config.go

ian-shim · 2024-12-09T19:28:39Z

node/grpc/server_v2.go

@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
 			return
 		}

+		s.metrics.ReportStoreChunksDataSize(size)


what if the store operation gets reverted in L125?

As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).

I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.

As discussed offline, this will be left the way it currently is.

node/grpc/server_v2.go

ian-shim · 2024-12-09T19:31:09Z

node/grpc/v2_metrics.go

+
+		for m.isAlive.Load() {
+			var size int64
+			err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error {


is this thread safe?

It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk() will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.

Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info() call?

Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.

In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.

The final option would be to just delete this metric entirely. I'll defer to your judgement on this.

This metric is now removed.

node/grpc/v2_metrics.go

node/grpc/server_v2.go

Signed-off-by: Cody Littley <[email protected]>

node/flags/flags.go

node/grpc/server_v2.go

node/grpc/v2_metrics.go

jianoaix · 2024-12-11T21:17:54Z

node/grpc/v2_metrics.go

+
+func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) {
+	m.storeChunksLatency.WithLabelValues().Observe(
+		float64(latency.Nanoseconds()) / float64(time.Millisecond))


float64(latency.Milliseconds()) should work

I'm intentionally not using that (@ian-shim made the same suggestion on another PR 😜). latency.Milliseconds() returns an int, meaning we lose all sub-millisecond fidelity in the measurement. For many metrics this isn't a hug deal (e.g. if you are measuring something that takes 100s of milliseconds), but for some things we are measuring the precision is nice to have.

Let me know if you'd like to discuss this further.

I think it depends on what the expected number we will deal with here. If it's ultra low latency and sub-ms matters, Microseconds() may be an option.

For the sake of consistency, I've been converting everything to ms for reporting latencies.

Although I can often make an educated guess as to the expected latency for an operation, absent experimental data it's only a guess. If I guess wrong, then we could end up in a situation where we guess wrong.

Would this be a topic you think worthwhile to schedule a short call to discuss?

As discussed, all of these now use a utility method in common.

node/store_v2.go

Signed-off-by: Cody Littley <[email protected]>

node/grpc/metrics_v2.go

Signed-off-by: Cody Littley <[email protected]>

cody-littley added 18 commits November 26, 2024 15:28

Add metrics to relay.

362a390

Signed-off-by: Cody Littley <[email protected]>

Incremental progress.

3b4637e

Signed-off-by: Cody Littley <[email protected]>

Incremental progress.

60f015e

Signed-off-by: Cody Littley <[email protected]>

Incremental progress, need running averages.

2d7e9ef

Signed-off-by: Cody Littley <[email protected]>

Added running average metrics for GetChunks

b8c7d35

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into relay-metrics

a6692c4

Signed-off-by: Cody Littley <[email protected]>

Documentation

b9d71d6

Signed-off-by: Cody Littley <[email protected]>

Add time window to metrics doc

671f0c8

Signed-off-by: Cody Littley <[email protected]>

Added GetBlob metrics.

4adb7ea

Signed-off-by: Cody Littley <[email protected]>

Cleanup.

5579a88

Signed-off-by: Cody Littley <[email protected]>

Cleanup test

2b84f21

Signed-off-by: Cody Littley <[email protected]>

Add locking for running average metric.

fb0cad5

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into relay-metrics

a2c05cb

Signed-off-by: Cody Littley <[email protected]>

Add cache metrics.

dfd2925

Signed-off-by: Cody Littley <[email protected]>

Fix test bug

24f5f5d

Signed-off-by: Cody Littley <[email protected]>

Made suggested change.

c3adb70

Signed-off-by: Cody Littley <[email protected]>

Added metrics for v2 DA node.

5c8c173

Signed-off-by: Cody Littley <[email protected]>

Added metrics documentation.

1795654

Signed-off-by: Cody Littley <[email protected]>

cody-littley requested review from jianoaix and ian-shim December 3, 2024 17:46

cody-littley self-assigned this Dec 3, 2024

Merge branch 'master' into node-metrics

434c6b9

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as ready for review December 6, 2024 16:48

Revert deletions.

5c9274c

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as draft December 6, 2024 17:50

cody-littley added 4 commits December 6, 2024 11:53

Remove documentation.

4d4bfe9

Signed-off-by: Cody Littley <[email protected]>

Reimplement without metrics framework.

d9d898c

Signed-off-by: Cody Littley <[email protected]>

Cleanup.

8bd8ff1

Signed-off-by: Cody Littley <[email protected]>

Stop background thread when metrics are stopped.

2070eee

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as ready for review December 6, 2024 18:24

Revert unintentional change

cffa884

Signed-off-by: Cody Littley <[email protected]>

ian-shim reviewed Dec 9, 2024

View reviewed changes

jianoaix reviewed Dec 10, 2024

View reviewed changes

node/grpc/server_v2.go Outdated Show resolved Hide resolved

cody-littley added 5 commits December 10, 2024 10:07

Made suggested changes.

5c511c9

Signed-off-by: Cody Littley <[email protected]>

Don't start two metrics servers.

a15117f

Signed-off-by: Cody Littley <[email protected]>

Fix compile issue.

1076a8f

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into node-metrics

bbf9005

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into node-metrics

62ec4f6

Signed-off-by: Cody Littley <[email protected]>

jianoaix reviewed Dec 11, 2024

View reviewed changes

cody-littley added 4 commits December 12, 2024 08:22

Enable debug code.

143b798

Signed-off-by: Cody Littley <[email protected]>

Debug

168ded5

Signed-off-by: Cody Littley <[email protected]>

Fix inabox bug.

dd21f61

Signed-off-by: Cody Littley <[email protected]>

Made suggested changes.

1bb1404

Signed-off-by: Cody Littley <[email protected]>

jianoaix reviewed Dec 12, 2024

View reviewed changes

node/grpc/metrics_v2.go Show resolved Hide resolved

Made suggested changes.

864e7d0

Signed-off-by: Cody Littley <[email protected]>

jianoaix approved these changes Dec 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node metrics #948

Node metrics #948

cody-littley commented Dec 3, 2024 •

edited

Loading

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024

cody-littley Dec 13, 2024

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024 •

edited

Loading

cody-littley Dec 13, 2024

jianoaix Dec 11, 2024

cody-littley Dec 12, 2024 •

edited

Loading

jianoaix Dec 12, 2024

cody-littley Dec 13, 2024

cody-littley Dec 13, 2024

Node metrics #948

Are you sure you want to change the base?

Node metrics #948

Conversation

cody-littley commented Dec 3, 2024 • edited Loading

Why are these changes needed?

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley commented Dec 3, 2024 •

edited

Loading

cody-littley Dec 10, 2024 •

edited

Loading

cody-littley Dec 12, 2024 •

edited

Loading